{
  "cells": [
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "fodGF4u8srn7"
      },
      "source": [
        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/search/semantic-search/sparse/bm25/bm25-vector-generation.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/search/semantic-search/sparse/bm25/bm25-vector-generation.ipynb)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "7aAUyL65sx8S"
      },
      "source": [
        "# Generating hybrid Sparse-Dense embeddings using BM25 and sentence transformers\n",
        "\n",
        "[![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/fast-link.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/pinecone/sparse/bm25/bm25-quora.ipynb)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "yTJM3Al3srTj"
      },
      "source": [
        "## Overview\n",
        "\n",
        "[BM25](https://en.wikipedia.org/wiki/Okapi_BM25) is a popular technique for retrieving text. It uses term frequencies to determine the relative importance of the term to the query. It is simple but effective and only requires knowing the number of documents in a corpus and the frequency of terms across documents. In the following guide, we will show how to use BM25 with Pinecone's sparse-dense vectors for use in hybrid search.\n",
        "\n",
        "Skip the embedding creation step by using the [companion guide](https://github.com/pinecone-io/examples/blob/master/pinecone/sparse/bm25/bm25-quora.ipynb)."
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "0aoUzVbetha4"
      },
      "source": [
        "## Prerequisites"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "RDnXwZIODsa7"
      },
      "source": [
        "We'll install the required libraries:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "eYRHcfJvtksh",
        "outputId": "67779834-d154-40ca-ed46-d647281e280a"
      },
      "outputs": [],
      "source": [
        "!pip install -qU \\\n",
        "    pinecone-text \\\n",
        "    sentence-transformers \\\n",
        "    torch \\\n",
        "    pandas==2.0.2 \\\n",
        "    datasets==2.12.0 \\\n",
        "    pyarrow"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "id": "ivP8NyWWttsJ"
      },
      "outputs": [],
      "source": [
        "from tqdm.auto import tqdm"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "LpQi7d6AuymO"
      },
      "source": [
        "## Quora Dataset\n",
        "\n",
        "First We'll load the popular [Quora dataset](https://huggingface.co/datasets/quora).\n",
        "The Quora dataset is composed of question pairs, and the task is to determine if the questions are duplications of each other (have the same meaning)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "id": "QyExjJg8u1Uq"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "Found cached dataset quora (/Users/amnoncatav/.cache/huggingface/datasets/quora/default/0.0.0/36ba4cd42107f051a158016f1bea6ae3f4685c5df843529108a54e42d86c1e04)\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "6fc8e2e2f62b4e55ab87a16208a6c603",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "  0%|          | 0/1 [00:00<?, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "from datasets import load_dataset\n",
        "import pandas as pd\n",
        "\n",
        "dataset = load_dataset(\"quora\")"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let's look on a single record in this dataset"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 206
        },
        "id": "yGShs5o7u5mz",
        "outputId": "d278c82a-18fe-4e06-dec4-3664e9d14fd3"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'questions': {'id': array([15, 16], dtype=int32),\n",
              "  'text': array(['How can I be a good geologist?',\n",
              "         'What should I do to be a great geologist?'], dtype=object)},\n",
              " 'is_duplicate': True}"
            ]
          },
          "execution_count": 4,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "raw_df = dataset[\"train\"].to_pandas()\n",
        "\n",
        "raw_df.iloc[7].to_dict()"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As you can see above, each record contains two questions and a binary label that indicates whether the questions duplicates of each other.\n",
        "\n",
        "Before we start to create embedding from our data, we need to process it. First, we need to convert the JSON column into distinct columns within our dataframe."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>is_duplicate</th>\n",
              "      <th>text</th>\n",
              "      <th>id</th>\n",
              "      <th>dq_id</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>False</td>\n",
              "      <td>What is the step by step guide to invest in sh...</td>\n",
              "      <td>1</td>\n",
              "      <td>2</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>False</td>\n",
              "      <td>What is the story of Kohinoor (Koh-i-Noor) Dia...</td>\n",
              "      <td>3</td>\n",
              "      <td>4</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>False</td>\n",
              "      <td>How can I increase the speed of my internet co...</td>\n",
              "      <td>5</td>\n",
              "      <td>6</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>False</td>\n",
              "      <td>Why am I mentally very lonely? How can I solve...</td>\n",
              "      <td>7</td>\n",
              "      <td>8</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>False</td>\n",
              "      <td>Which one dissolve in water quikly sugar, salt...</td>\n",
              "      <td>9</td>\n",
              "      <td>10</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "   is_duplicate                                               text  id  dq_id\n",
              "0         False  What is the step by step guide to invest in sh...   1      2\n",
              "1         False  What is the story of Kohinoor (Koh-i-Noor) Dia...   3      4\n",
              "2         False  How can I increase the speed of my internet co...   5      6\n",
              "3         False  Why am I mentally very lonely? How can I solve...   7      8\n",
              "4         False  Which one dissolve in water quikly sugar, salt...   9     10"
            ]
          },
          "execution_count": 5,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "raw_df[\"q1\"] = raw_df[\"questions\"].apply(lambda q: q[\"text\"][0])\n",
        "raw_df[\"q2\"] = raw_df[\"questions\"].apply(lambda q: q[\"text\"][1])\n",
        "raw_df[\"id1\"] = raw_df[\"questions\"].apply(lambda q: q[\"id\"][0])\n",
        "raw_df[\"id2\"] = raw_df[\"questions\"].apply(lambda q: q[\"id\"][1])\n",
        "\n",
        "q1_to_q2 = raw_df.copy().rename(columns={\"q1\": \"text\", \"id1\": \"id\", \"id2\": \"dq_id\"}).drop(columns=[\"questions\", \"q2\"])\n",
        "q2_to_q1 = raw_df.copy().rename(columns={\"q2\": \"text\", \"id2\": \"id\", \"id1\": \"dq_id\"}).drop(columns=[\"questions\", \"q1\"])\n",
        "flat_df = pd.concat([q1_to_q2, q2_to_q1])\n",
        "\n",
        "flat_df.head()"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Next, we take a sample of the data and retaining just one row per question while storing a list of duplicated question IDs for each question."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>text</th>\n",
              "      <th>id</th>\n",
              "      <th>duplicated_questions</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>2003</th>\n",
              "      <td>What name will you give to a mathematics educa...</td>\n",
              "      <td>3985</td>\n",
              "      <td>[3986, 535562]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2004</th>\n",
              "      <td>What universities does Federal-Mogul recruit n...</td>\n",
              "      <td>3987</td>\n",
              "      <td>[3988, 23638]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2005</th>\n",
              "      <td>Why do people go for mba after masters in engi...</td>\n",
              "      <td>3989</td>\n",
              "      <td>[3990, 161829, 316113, 168810]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2006</th>\n",
              "      <td>How does a facebook account get hacked?</td>\n",
              "      <td>3991</td>\n",
              "      <td>[3992, 379602, 35699]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2007</th>\n",
              "      <td>Why do black people get upset when non-black r...</td>\n",
              "      <td>3993</td>\n",
              "      <td>[3994]</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "                                                   text    id   \n",
              "2003  What name will you give to a mathematics educa...  3985  \\\n",
              "2004  What universities does Federal-Mogul recruit n...  3987   \n",
              "2005  Why do people go for mba after masters in engi...  3989   \n",
              "2006            How does a facebook account get hacked?  3991   \n",
              "2007  Why do black people get upset when non-black r...  3993   \n",
              "\n",
              "                duplicated_questions  \n",
              "2003                  [3986, 535562]  \n",
              "2004                   [3988, 23638]  \n",
              "2005  [3990, 161829, 316113, 168810]  \n",
              "2006           [3992, 379602, 35699]  \n",
              "2007                          [3994]  "
            ]
          },
          "execution_count": 6,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "df = flat_df.drop_duplicates(\"id\").head(2000)\n",
        "df[\"duplicated_questions\"] = df[\"id\"].apply(lambda qid: flat_df[flat_df[\"id\"] == qid][\"dq_id\"].tolist())\n",
        "df = df.drop(columns=[\"dq_id\", \"is_duplicate\"])\n",
        "\n",
        "df.tail()"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "H2x-1gK37Zo7"
      },
      "source": [
        "### Fit BM25 with Spacy Tokenizer\n",
        "\n",
        "Here We'll create and fit a BM25 model using a simple split by space tokenization.\n",
        "\n",
        "Since currently all available Python libraries for BM25 are also indexing the documents, we will use a simple and efficient hash-based encoding implementation from the `pinecone-text` library."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {
        "id": "KR75WORb7kga"
      },
      "outputs": [],
      "source": [
        "from pinecone_text.sparse import BM25Encoder\n",
        "\n",
        "bm25 = BM25Encoder()"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "2dLrjpXb7nt2"
      },
      "source": [
        "We need to calculate how often tokens appear in documents for BM25 to be able to create sparse vectors. To do this we call `bm25.fit` across our full dataset."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "inEUqZcC7tf8",
        "outputId": "c1e93066-e8fc-46b4-8a7c-eb4dfd7c95b4"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "5d78e439abc442e0b375344f90ef37b2",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "  0%|          | 0/2000 [00:00<?, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/plain": [
              "<pinecone_text.sparse.bm25_encoder.BM25Encoder at 0x149713df0>"
            ]
          },
          "execution_count": 8,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "bm25.fit(df['text'])"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "gO5BYxOs7wSt"
      },
      "source": [
        "\n",
        "### Dense Model\n",
        "\n",
        "We use the popular [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model available on Hugging Face for dense vectors."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 483,
          "referenced_widgets": [
            "01c5e61ebfd1415da3f92c6df701851e",
            "ed82e1e36af748afb86db14b41dd68c2",
            "6b2133a7f53a4b2ab02278cb22fc48e1",
            "353cc7b3ce54470f8590940dcefe5c7f",
            "cddd40a9e6024baab4b95d974c815553",
            "5f44fda21e0049509faccc7c2ef8bf6a",
            "39c18b6315c648d295a4c2fac14e5ac2",
            "4ae89fee1276497fa87c718f340754a9",
            "c64e346feef1499bb275501b73a1e1e5",
            "14bac3cc3ae2469dbad8932a372727a0",
            "582fdefeb3dd47c691381275dae6b552",
            "038ed6dfeccd4c77a5fb652cde4d8c68",
            "1bb6fbc82d944a80b06733a8880ea31e",
            "ed6e0124fe844dbabe8fc07112ce3ee6",
            "d2a4b52aa22f4d4da6597fe4e3b6ee15",
            "10f2c5c85eaf498d87bc11d0377af9d7",
            "877f9a66fbde4ee2951bca8321ec5321",
            "70e5def84a834550a4b481cdd3e551a3",
            "75489a3fcbcc4b64ba1086ba1553662c",
            "3b15e76c863445e79b60ac4ffd70a9d8",
            "68246c686d3d4a858d23300792799a9d",
            "98fb0aa882604785863bd205d90ce37a",
            "87173bd5cf824222aab1a903f8589800",
            "fbadac48b05c48a4a9f109e6a8aba52e",
            "604e3d064d074771bdb48269b76491a5",
            "9065d4e26cb048a3a598c68f191e19cb",
            "e8a66565ff1f47498130fbccba5e360d",
            "020c65b0dd194d26a4cfabc1c7022345",
            "841d1789cb6f4a15bd6515876f970b40",
            "b9d8d0fa65bf4691bce683632fd9c665",
            "3ebeda421bc4449ebfe065967f1e98e1",
            "9a35d22c2c03400c8dfe5775234c8d66",
            "75fc032df0f6471f81b31693663e9026",
            "61904fcd9b0a4a13a0bb61110424f5bd",
            "db9da551de584b3eae77b56d58017da7",
            "5c3e541273194b6683d3a19c8411c804",
            "59e70907ecea4982bfc9a8c947e60435",
            "c48bac69d9804ae6b714c4e4e2389e49",
            "ebc2366c91f84a9e8ed79f05a5732991",
            "91d0da40caca48bca0cc781093b81127",
            "ec9522f7e48141d494490936d43a3229",
            "0e2582a061a14f68a8df90aa8c3e1014",
            "ae5a4960c15845cfad5e4fc39441cc66",
            "c065ccaebf26459296949ee21f9002d4",
            "a491b0ad9452451c97694c31d88bec6e",
            "a850fbb2f23141b78d673ddcf50951c6",
            "60e6c35a97b84bbdb67de1ebeda1a8c6",
            "719a5e30ab824f149d738314e60db040",
            "8938a373db4548b18b57b6cccd13997a",
            "687a368cbe9b4853b13e4f99adebcedc",
            "c6f615c61734419bb84eab7e60fd8813",
            "afa89967f64a4216bab3e9b22148d3d0",
            "7b18920ee96b4b27a74941fb59ce800c",
            "6b6eb7ad61b0493ba796b6f686555367",
            "1f686c50cc3d484dbea71c4996ac3ada",
            "3e59832ad9fb49298eea2c5408b28a6b",
            "0c6f0f1a9acc44fcb121aa8041333993",
            "d7d6cf21b0ad4d6db1905d031e6afc78",
            "af710fa5078a42c19b6c2b4bc850a509",
            "b749a0f8f7a1438186c3067235b68187",
            "3646c47354184361b79bfbc01feda20b",
            "b31b5e992a884323a73f404358a06bc0",
            "de62b165caa341669111cc29b8f53c5b",
            "a99ee327d4574137a996f93c57444195",
            "fc9ad59d6990407aaef3fc6333031b89",
            "77fd1bfb28584c479cab6a781c7a7fe0",
            "747a652803034b67a80af09c73a87bca",
            "1c29e9cc86224b14a45cdce60b45c73a",
            "6fb7a32aeb864956b3d490bacb07b620",
            "c3f472331ff64482aecab4f5cc668e1a",
            "f27b3a5d8ac64dd1ac9ce79ef2d983f2",
            "eca628ab858548238a71ba9affb9795d",
            "92c1e228091d40979d13d42cd2b8f129",
            "cf1f760872a948bbb710e821ca3430bc",
            "fe852b69273a4513a62f77361e32d8ad",
            "7e885bd82afb451887b20a7af6d474e3",
            "38f9d13d8c4f47f2a4ce5d140a136d50",
            "24ac206c0ce342f19458b3b5e0881c33",
            "925fd40fa80a465ea29d46b62e424a9c",
            "e94cf6aea7b148a18c3bbcda93e80d4d",
            "374f38c249ad479e864c1afd203c0cc7",
            "5456774969d84755b94113ef601029f5",
            "666c10310c234a06b542511127edd4d6",
            "5c68fb47656745eb918bdb2296118b3f",
            "599a35de46644ff7b2842b6b6a4e71c9",
            "3b734ecfdbd34575a51f14d6c6426a56",
            "0b333c60159e47a3810c2da9468c9d3d",
            "8214fbe8b366401fa438aa940d5e3584",
            "13f5bdde51f84bddb89f464b3122244f",
            "14ef1df14c124ce781b0ac50e53104ee",
            "3662a913fc004ee58627c4e382e8e9da",
            "808e4be48d1149bb8aeecd927c72a783",
            "4d9d8cae4f0d480c800538b2bdef2cd2",
            "82d62bc023ce4f99a095ae2938d3ac0d",
            "3521b6050d46449bbd1f19a612ebcf45",
            "adf47c9b6d0c47269f5344c64ff5c835",
            "22c09a27002c421b9af5beb4f4fc955c",
            "b9911a8530dc4fa99ca1c39af0d3718c",
            "016d5b0f086544f4a065afcdcbb7abe2",
            "c42fdf378dbf4af5b8bc8e8dafe12660",
            "95e43ed1e11542b8bdb5bf020898df02",
            "dd6cda4cd0914dad8b03a7a8238fd88d",
            "547fab0a4c834ae38ac59c20173cb614",
            "7489d1f9dd0a484fbb1a7464ab9d63c6",
            "5f2442b0491a425da748b3f70f694194",
            "b1cace8f0b464b9b9e8c9dec5ce3c1df",
            "3f33e610c3b6476989805ec9c48f165e",
            "3159a8f86d794d33aa58aa9f21eaab9c",
            "b6e7f3b7926a454d962d2627710d5c5d",
            "c98a40e67bbb4b2c80289e3e10042d66",
            "6fa9ce872a51457baf3ec3388f5a0265",
            "b4a1add99ba44c0fa956f1afa41bd096",
            "de71aed3bd5e41eeb070324377916544",
            "22582e3fccf6493daf1ac7f9f37c474c",
            "3c8b6ff0a02642789079d2ed1ef27a57",
            "4b4e197fcedf4da6af2aa6989aba913b",
            "d3eefbb7bcf2478a88e7653a0062899f",
            "41a1c656f3d64c1cbe38d0e49b94c296",
            "cbfbca4cad944021b506b6f2eb6e1778",
            "0b2c8196a1d04412aa159fe8d40cae0d",
            "f9add21d20fc4fbebae9cca5a731ab6f",
            "f0be6ed0e82c4f05b26fe7cb4fca0b90",
            "28fc681c385b49cb956b260db2338614",
            "a416b3e082e44ae9b7379d03fc3816fa",
            "360cd301ab7241329d2a580801ebf46f",
            "2380df51e9954a56965e5c284e35b53d",
            "ed44fb559c7c48ebb2957b3fad123385",
            "a702e2e18c824b4587b52517d79a586a",
            "a029ae0f365744e7ab7ac0d67f841393",
            "b15bea9236c1449e80201983111e0db3",
            "c8c71871044840bc98cb134384d22001",
            "df021819654540e799af56ce080d540c",
            "5d1c7dc8aa4b4460a1f9dd9d3cb9a2f9",
            "65336fa6eb1c429181289e8c37c10fd7",
            "17e7198232b544e18051e26416a06786",
            "782adef4d4d04491bc852d8fd5ce5342",
            "a3d298f7c70444e59c41a69d3d55aeba",
            "466042fc212a488b8533847e6a52efcc",
            "73dd5a27b2bc4ba69595dd958953c4c2",
            "275c63ae830b4ec4ab7aacdf870feb3c",
            "22441a7932994fcc92d2edea6e9aaa6a",
            "9db255c713ce4203aaa13e712879d96e",
            "6cb03641ef0449709808db1f42cc0e5c",
            "83e409c8c74946a2a63b29875ad799a0",
            "dc04f57b25a8430384a246cbb3e6cf3b",
            "99f7dacae40b443e93902c5fda720e7f",
            "a70e446faa6846468a8035968556d522",
            "dd4fd6dab7b440a485fa68f1b8bd4a80",
            "8946699ddeb1426091e3672405c253a5",
            "40f41b8f7b754eaea25233d17b6e0218",
            "6402545355f243d8b26fde2d262d018f",
            "2f9478abf95949e98db1242e920f2b0d",
            "d5a45ce8c5be44738cf4943e2a21386e",
            "76cedb21c705470e9f7c622afcec06d2"
          ]
        },
        "id": "JTnCqvC67yRl",
        "outputId": "323a18e6-14bc-478d-a5f2-1bdb798b0f22"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "running on cpu\n"
          ]
        }
      ],
      "source": [
        "from sentence_transformers import SentenceTransformer\n",
        "import torch\n",
        "\n",
        "device = 'cuda' if torch.cuda.is_available() else 'cpu'\n",
        "print(f\"running on {device}\")\n",
        "\n",
        "model = SentenceTransformer(\n",
        "    'sentence-transformers/all-MiniLM-L6-v2',\n",
        "    device=device\n",
        ")"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "FTmI-MtQ70L_"
      },
      "source": [
        "### Compute Dense & Sparse Embeddings\n",
        "\n",
        "Create BM25 sparse embeddings:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 10,
      "metadata": {
        "id": "0dZAv1gU79x1"
      },
      "outputs": [],
      "source": [
        "df['sparse_values'] = df['text'].apply(bm25.encode_documents)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "085BqCTQDsa-"
      },
      "source": [
        "And now encode dense vector embeddings:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 11,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 49,
          "referenced_widgets": [
            "529013226c534ca69da1f0ccd2f15a91",
            "8ea029b1bd914329a3ca9af6077aa46d",
            "39fdb55b42414deeb073eee9be59e673",
            "3175e867c1bf4e70bd66a4fc88a4af80",
            "662da2ba4c344c75aa04c90f25832a65",
            "2d8b40c2f558454988b103dc7c49034b",
            "be0234ac46b34d4fb871d741fdd320ce",
            "a06b883c746e48c7945a45fc11c2914f",
            "2985faffabc64d9fa4972daefd5d1a75",
            "631d41ea09754f44bbdf252604a484c8",
            "aa216bd5e93e427d82b6f3e7d9834928"
          ]
        },
        "id": "7LEwg_JM8N6j",
        "outputId": "e9b73fd4-be94-4701-c85a-004611bf1365"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "8790441c2ced451fb3ad1cc37cd13aa8",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "  0%|          | 0/16 [00:00<?, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "batch_size = 128\n",
        "dense_values = []\n",
        "for i in tqdm(range(0, len(df), batch_size)):\n",
        "  dense_values += model.encode(df.iloc[i:i + batch_size][\"text\"].tolist()).tolist()\n",
        "\n",
        "df['values'] = dense_values"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "b6IGmM5hGVIt"
      },
      "source": [
        "We organize our dataframe to align to the [pinecone-datasets](https://pypi.org/project/pinecone-datasets/) format:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 12,
      "metadata": {
        "id": "hwirAGEYFh3v"
      },
      "outputs": [],
      "source": [
        "df_result = df.copy()\n",
        "df_result[\"metadata\"] = None\n",
        "df_result[\"blob\"] = df_result.apply(lambda r: {\"text\": r[\"text\"], \"duplicated_questions\": r[\"duplicated_questions\"]}, axis=1)\n",
        "df_result = df_result.drop(columns=\"text\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 13,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 250
        },
        "id": "6SudYEYOHR3b",
        "outputId": "9267eef8-b716-4db3-dda5-697e17c5918b"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>id</th>\n",
              "      <th>duplicated_questions</th>\n",
              "      <th>sparse_values</th>\n",
              "      <th>values</th>\n",
              "      <th>metadata</th>\n",
              "      <th>blob</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>1</td>\n",
              "      <td>[2]</td>\n",
              "      <td>{'indices': [2386266869, 3085233890, 550510908...</td>\n",
              "      <td>[0.06814990937709808, -0.03966414928436279, -0...</td>\n",
              "      <td>None</td>\n",
              "      <td>{'text': 'What is the step by step guide to in...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>3</td>\n",
              "      <td>[4, 282170, 380197, 488853]</td>\n",
              "      <td>{'indices': [1358668261, 1991459541, 930853874...</td>\n",
              "      <td>[-0.04679807275533676, 0.15511494874954224, -0...</td>\n",
              "      <td>None</td>\n",
              "      <td>{'text': 'What is the story of Kohinoor (Koh-i...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>5</td>\n",
              "      <td>[6]</td>\n",
              "      <td>{'indices': [3570117707, 997512866, 3217550068...</td>\n",
              "      <td>[-0.02832488901913166, 0.037209585309028625, -...</td>\n",
              "      <td>None</td>\n",
              "      <td>{'text': 'How can I increase the speed of my i...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>7</td>\n",
              "      <td>[8]</td>\n",
              "      <td>{'indices': [3705065656, 97715094, 930102859],...</td>\n",
              "      <td>[0.06325336545705795, -0.05639313533902168, 0....</td>\n",
              "      <td>None</td>\n",
              "      <td>{'text': 'Why am I mentally very lonely? How c...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>9</td>\n",
              "      <td>[10, 109465, 480204]</td>\n",
              "      <td>{'indices': [2257684172, 1522609561, 918197339...</td>\n",
              "      <td>[-0.04876847192645073, -0.025538966059684753, ...</td>\n",
              "      <td>None</td>\n",
              "      <td>{'text': 'Which one dissolve in water quikly s...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "   id         duplicated_questions   \n",
              "0   1                          [2]  \\\n",
              "1   3  [4, 282170, 380197, 488853]   \n",
              "2   5                          [6]   \n",
              "3   7                          [8]   \n",
              "4   9         [10, 109465, 480204]   \n",
              "\n",
              "                                       sparse_values   \n",
              "0  {'indices': [2386266869, 3085233890, 550510908...  \\\n",
              "1  {'indices': [1358668261, 1991459541, 930853874...   \n",
              "2  {'indices': [3570117707, 997512866, 3217550068...   \n",
              "3  {'indices': [3705065656, 97715094, 930102859],...   \n",
              "4  {'indices': [2257684172, 1522609561, 918197339...   \n",
              "\n",
              "                                              values metadata   \n",
              "0  [0.06814990937709808, -0.03966414928436279, -0...     None  \\\n",
              "1  [-0.04679807275533676, 0.15511494874954224, -0...     None   \n",
              "2  [-0.02832488901913166, 0.037209585309028625, -...     None   \n",
              "3  [0.06325336545705795, -0.05639313533902168, 0....     None   \n",
              "4  [-0.04876847192645073, -0.025538966059684753, ...     None   \n",
              "\n",
              "                                                blob  \n",
              "0  {'text': 'What is the step by step guide to in...  \n",
              "1  {'text': 'What is the story of Kohinoor (Koh-i...  \n",
              "2  {'text': 'How can I increase the speed of my i...  \n",
              "3  {'text': 'Why am I mentally very lonely? How c...  \n",
              "4  {'text': 'Which one dissolve in water quikly s...  "
            ]
          },
          "execution_count": 13,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "df_result.head()"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "aY0g3W2DhNK0"
      },
      "source": [
        "And now we have all we need to start using Pinecone vector database \ud83d\ude80\n",
        "\n",
        "For more details on that, check out [this notebook](https://github.com/pinecone-io/examples/blob/master/pinecone/sparse/bm25/bm25-quora.ipynb)."
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": ".venv",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.10"
    },
    "vscode": {
      "interpreter": {
        "hash": "57376684f67c5d7b1589c855d7d0f1a1bdf8944ab1b903e711fdbf39434567bb"
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}